Improving OCR Accuracy for Classical Critical Editions

نویسندگان

  • Federico Boschetti
  • Matteo Romanello
  • Alison Babeu
  • David Bamman
  • Gregory R. Crane
چکیده

This paper describes a work-flow designed to populate a digital library of ancient Greek critical editions with highly accurate OCR scanned text. While the most recently available OCR engines are now able after suitable training to deal with the polytonic Greek fonts used in 19th and 20th century editions, further improvements can also be achieved with postprocessing. In particular, the progressive multiple alignment method applied to different OCR outputs based on the same images is discussed in this paper.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

National Library of Australia

This article details the work undertaken by the National Library of Australia Newspaper Digitisation Program on identifying and testing solutions to improve OCR accuracy in large scale newspaper digitisation programs. In 2007 and 2008 several different solutions were identified, applied and tested on digitised material now available in the Australian Newspapers Digitisation Program beta service...

متن کامل

Important New Developments in Arabographic Optical Character Recognition (OCR)

Leipzig University’s (LU) Alexander von Humboldt Chair for Digital Humanities—has achieved Optical Character Recognition (OCR) accuracy rates for classical Arabic-script texts in the high nineties. These numbers are based on our tests of seven different Arabic-script texts of varying quality and typefaces, totaling over 7,000 lines (~400 pages, 87,000 words; see ​Table 1​ for full details). The...

متن کامل

Improving OCR Performance in Biomedical Literature Retrieval through Preprocessing and Postprocessing

Today’s information retrieval (IR) techniques are mostly text-based. As a consequence, some types of information are beyond the reach of text-based IR systems, which fail in situations where textual information can not be easily accessed, e.g. textual information in biomedical images and figures. To tackle such situations, we propose to augment IR systems with the ability to perform optical cha...

متن کامل

Greek and Latin corpora with variants and conjectures : Mapping critical apparatuses onto reference text

The principal corpora currently available in classical literature, while quite thorough, are based on authoritative editions without critical apparatuses. However, philologists need to deal with textual variants attested by manuscripts and conjectures suggested by scholars through the centuries. This paper will explore some methods for information extraction applied to digitised apparatuses of ...

متن کامل

How to Face the Crisis of Legitimacy: The Transfer and Further Development of Methods of Access from Printed to Digital/Digitised Editions

All media provide media specific methods of access to information and therefore media change affects also these methods of access. But the change of media and hence access methods also raises the question of legitimacy of doing this, in terms of scholarly working as well as in terms of justification in the face of the funding general public financing research either directly by the government o...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2009